Capstone Project - RSNA Pneumonia Detection Challenge

Group 6-CV 1:

Contributor - Ashish Tiwari, Kanishk Sanger, Ninad Mahajan, Tushar Bisht, Avinash Kumra Sharma

Mentor - Mr. Rohit raj

Data Source: https://www.kaggle.com/c/rsna-pneumonia-detection-challenge/data

Import Packages

Exploratory Data Analysis (EDA)

Here as a part of EDA, we will:

  1. Start with understanding of the data with a brief on train/test labels and respective class info
  2. Look at the first five rows of both the csvs (train and test)
  3. Identify how are classes and target distributed
  4. Check the number of patients with 1, 2, ... bounding boxes
  5. Read and extract metadata from dicom files
  6. Perform analysis on some of the features from dicom files
  7. Check some random images from the training dataset
  8. Draw insights from the data at various stages of EDA

Reading CSVs

Images for the current stage in the stage_2_train_images and stage_2_test_images-

Training data: stage_2_train_labels.csv

stage_2_detailed_class_info.csv containing detailed information about the positive and negative classes in the training set

Observations from the CSVs

Based on analysis above, some of the observations:

  1. Training data is having a set of patientIds and bounding boxes. Bounding boxes are defined as follows: x, y, width and height.
  2. There are multiple records for patients. Number of duplicates in patientID = 3,543.
  3. There is also a binary target column i.e. Target indicating there was evidence of pneumonia or no definitive evidence of pneumonia.
  4. Class label contains: No Lung Opacity/Not Normal, Normal and Lung Opacity. Chest examinations with Target = 1 i.e. ones with evidence of Pneumonia are associated with Lung Opacity class.
  5. Chest examinations with Target = 0 i.e. those with no definitive effidence of Pneumonia are either of Normal or No Lung Opacity / Not Normal class.
  6. About 23,286 patientIds (~87% of them) provided have 1 bounding boxes while 13 patients have 4 bounding boxes!!!!

Reading Images

Images provided are stored in DICOM (.dcm) format which is an international standard to transmit, store, retrieve, print, process, and display medical imaging information. Digital Imaging and Communications in Medicine (DICOM) makes medical imaging information interoperable. We will make use of pydicom package here to read the images.

Reference= https://pydicom.github.io/pydicom/stable/old/ref_guide.html

Understanding different View Positions

As seen below, two View Positions that are in the training dataset are AP (Anterior/Posterior) and PA (Posterior/Anterior). These type of X-rays are mostly used to obtain the front-view. Apart from front-view, a lateral image is usually taken to complement the front-view.

1. Posterior/Anterior (PA): In PA, X-Ray beam hits the posterior (back) part of the chest before the anterior (front) part. While obtaining the image patient is asked to stand with their chest against the film.

2.Anterior/Posterior (AP): Attimes it's not possible for radiographers to acquire a PA chest X-ray. This is usually because the patient is too unwell to stand. AP projection images are of lower quality than PA images. Heart size is exaggerated (cardiothoracic ratio approximately 50%)

Reference= https://www.radiologymasterclass.co.uk/tutorials/chest/chest_quality/chest_xray_quality_projection#top_1st_img

image.png

Computed radiography (CR) :

It is the digital replacement of conventional X-ray film radiography and offers enormous advantages for inspection tasks – the use of consumables is virtually eliminated and the time to produce an image is drastically shortened.

Exploring the bounding boxes for both view positions :

  1. Centers of the rectangle would be x+width/2 and y+height/2
  2. We will make use of (bboxes_scatter) function in the (eda).

Reference for this function & plots: https://www.kaggle.com/gpreda/rsna-pneumonia-detection-eda

Exploring the bounding boxes centers for ViewPositions for random sample 1000

Observations: BodyPartExamined & ViewPosition

Above we saw,

  1. BodyPartExamined is unique for all cases and is CHEST in the training dataset and that was also expected.
  2. Unique in Modality is CR i.e. Computer Radiography
  3. Overall ViewPosition is almost equally distributed in the training dataset but for cases where Target=1, most of the view position are AP.

We can make use of pd.clip() to trim value to a specified lower and upper threshold. So an upper threshold of 100 in PatientAge would mean those outlier values being converted to 100.

Using Binning Method for PatientAge feature

We'll make use of a pd.cut which is 'Bin values into discrete intervals'. Use of this method is recommended when need is to segment and sort data values into bins. This function is also useful for going from a continuous variable to a categorical variable. Supports binning into an equal number of bins, or a pre-specified array of bins.

Reference = https://pandas.pydata.org/docs/reference/api/pandas.cut.html

Observations: PatientAge & PatientSex

Above we saw,

  1. For PatientAge we saw the distribution for both overall and where there were evidence of Pneumonia. Used binning to check the count of age bins. Count was highest for age group 40-78 both overall and with Pneumonia Evidence.
  2. Saw distribution of age for Male and Female with Pneumonia Evidence.
  3. Dataset had more Males (57%-58%) than Females (42%-43%).

Only PatientAge, PatientSex and ViewPosition are useful features from metadata.

Dropping the other features from train_class dataframe and save that as a pickle file

Check some random samples from training data

Checking some random samples as below:

  1. Different classes i.e. Normal, No Lung Opacity / Not Normal and Lung Opacity
  2. Two view positions that we have in the dataset
  3. For the one with Pneumonia Evidence and age = 92

Now, we will make use of custom module (eda) and function (plot_dicom_images) already imported earlier to visualize the images.

Plotting DICOM Images

Target = 0

As the above subplots are of the images which belong to either "Normal" or "No Lung Opacity / Not Normal", hence no bounding box is observed.

Target = 1

In the above subplots, we can see that the area covered by the box (in blue colour) depicts the area of interest i.e., the area in which the opacity is observed in the Lungs.

CNN Segmentation

Split into Train & Valid

Splitting the list of training images in train and valid images

Random shuffle the list of training images

Load Pneumonia Evidence

  1. Dictionary contains pairs per row.

  2. If there are multiple pneumonia evidence for a patient then the dictionary would contain multiple list tagged to a patient id key. If there's no pneumonia evidence it will have an empty list.

Reference: https://www.kaggle.com/jonnedtc/cnn-segmentation-connected-components

Plot Masks generated using generator function,randomly

Pneumonia_Classification_Model

Now, Lets Convert dicom to png images

Lets Split the data in train, valid and test sets

Model - DenseNet121

Reference: https://stackoverflow.com/questions/41032551/how-to-compute-receiving-operating-characteristic-roc-and-auc-in-keras

Lets plot accuracy, average precision and f1_score with the help of this refrence